When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

jankeromnes · 2022-09-13T11:43:07Z

Description

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed in ws-manager-bridge.

Drive-by: Delete instance tokens and track workspace stop also in irregular cases.

Related Issue(s)

Fixes #6770

How to test

From #6682 (comment):

Register the same ws-manager as dynamic cluster and give it a very high score

E.g. by running:

gpctl clusters get-tls-config
gpctl clusters register --name temp --url dns:///ws-manager:8080 --tls-path ./wsman-tls/
gpctl clusters update --name temp score 1000

Start a workspace on that cluster

Deregister the newly added cluster - should not work without --force, but should work with it.

Your started workspace should be marked as stopped/failed, instead of remaining stuck in a non-final state like running or stopping.

Release Notes

NONE

Documentation

Werft options:

/werft with-preview

jankeromnes · 2022-09-14T09:34:13Z

Hmm, in bridge.ts, it looks we have two different logics / flows for marking an instance as stopped. 🤔

On status update

(view code)

gitpod/components/ws-manager-bridge/src/bridge.ts

Lines 374 to 383 in a1df2b9

    
           case WorkspacePhase.STOPPED: 
        
               const now = new Date().toISOString(); 
        
               instance.stoppedTime = now; 
        
               instance.status.phase = "stopped"; 
        
               if (!instance.stoppingTime) { 
        
                   // It's possible we've never seen a stopping update, hence have not set the `stoppingTime` 
        
                   // yet. Just for this case we need to set it now. 
        
                   instance.stoppingTime = now; 
        
               } 
        
               lifecycleHandler = () => this.onInstanceStopped({ span }, userId, instance);

gitpod/components/ws-manager-bridge/src/bridge.ts

Lines 621 to 627 in a1df2b9

    
           await this.userDB.trace({ span }).deleteGitpodTokensNamedLike(ownerUserID, `${instance.id}-%`); 
        
           this.analytics.track({ 
        
               userId: ownerUserID, 
        
               event: "workspace_stopped", 
        
               messageId: `bridge-wsstopped-${instance.id}`, 
        
               properties: { instanceId: instance.id, workspaceId: instance.workspaceId }, 
        
           });

In irregular cases

(view code)

gitpod/components/ws-manager-bridge/src/bridge.ts

Lines 602 to 611 in a1df2b9

    
           protected async markWorkspaceInstanceAsStopped(ctx: TraceContext, info: RunningWorkspaceInfo, now: Date) { 
        
               const nowISO = now.toISOString(); 
        
               info.latestInstance.stoppingTime = nowISO; 
        
               info.latestInstance.stoppedTime = nowISO; 
        
               info.latestInstance.status.phase = "stopped"; 
        
               await this.workspaceDB.trace(ctx).storeInstance(info.latestInstance); 
        
               await this.messagebus.notifyOnInstanceUpdate(ctx, info.workspace.ownerId, info.latestInstance); 
        
               await this.prebuildUpdater.stopPrebuildInstance(ctx, info.latestInstance); 
        
           }

This is called in the following cases:

ws-manager doesn't know about this instance

gitpod/components/ws-manager-bridge/src/bridge.ts

Lines 518 to 523 in a1df2b9

    
           log.info( 
        
               { instanceId, workspaceId: instance.workspaceId }, 
        
               "Database says the instance is running, but wsman does not know about it. Marking as stopped in database.", 
        
               { installation }, 
        
           ); 
        
           await this.markWorkspaceInstanceAsStopped(ctx, ri, new Date());

the instance timed out in preparing, building or unknown phase:

gitpod/components/ws-manager-bridge/src/bridge.ts

Lines 583 to 593 in a1df2b9

    
           if ( 
        
               (currentPhase === "preparing" && timedOutInPreparing) || 
        
               (currentPhase === "building" && timedOutInBuilding) || 
        
               (currentPhase === "unknown" && timedOutInUnknown) 
        
           ) { 
        
               log.info(logContext, "Controller: Marking workspace instance as stopped", { 
        
                   creationTime, 
        
                   currentPhase, 
        
               }); 
        
               await this.markWorkspaceInstanceAsStopped(ctx, info, new Date(now)); 
        
           }

Differences

	Status Update	Irregular Cases
Set `stoppingTime`?	Only if not set	Always
Delete Gitpod tokens?	Yes	No
Track stopped in analytics?	Yes	No

Are these differences intended? Do they make sense? 🤔

EDIT: Assuming no and resolving the differences in this drive-by commit: b2aecd0

components/ws-manager-bridge/src/bridge.ts

svenefftinge · 2022-09-15T12:17:46Z

re your comment: We should set stoppingTime to the same time of stoppedTimeonly if it wasn't set.

@geropl 's comment here seems relevant for this change.

components/ws-manager-bridge/src/bridge.ts

geropl · 2022-09-15T12:40:59Z

components/ws-manager-bridge/src/bridge.ts

@@ -120,6 +120,7 @@ export class WorkspaceManagerBridge implements Disposable {
    }

    public stop() {
+        this.markAllRunningWorkspaceInstancesAsStopped();


I'm not sure I like this approach.

So far, stopping a bridge was an operation that has absolutely no effect on workspaces. This made it a very cheap operation, and allowed for great operational flexibility. E.g., if you had to fixup a DB entry, you could always remove/re-add an entry, with very limited downsides (delay of workspace updates for a handful of seconds). Or, if you wanted for a reconnect, you could remove and re-add a DB entry. Now, it has a very destructive side-effect.

💭 I wonder what happens if we stop a ws-manager-brige pod during a rollout. It would stop all workspaces on that cluster, no? @jankeromnes

What do you think of having a periodic clean-up where we check and stop instances for which no ws-manager exists?

Or a more general version: For all currently not-stopped instances, check it against ws-manager. This would catch a broader set of problems, but we already need to solve this problem anyway.

What do you think of having a periodic clean-up where we check and stop instances for which no ws-manager exists?

This is something we need anyway, but with different time constraints: This PR is about unblocking Team Workspace in specific cases. This is a separate PR, but same issue.

For all currently not-stopped instances, check it against ws-manager.

That's a layer of abstraction above this PR/issue. tl;dr: we're already doing it, this is about the implementation details Happy to provide context, outside of this PR. 🙃

Many thanks for the great feedback here! 💯

E.g., if you had to fixup a DB entry, you could always remove/re-add an entry, with very limited downsides (delay of workspace updates for a handful of seconds).

But why remove/re-add an entry when you can gpctl clusters update? 🤔 (E.g. to adjust the score)

Or, if you wanted for a reconnect, you could remove and re-add a DB entry. Now, it has a very destructive side-effect.

Can't you kill the ws-manager-bridge pod, or rollout restat its deployment to achieve this?

💭 I wonder what happens if we stop a ws-manager-brige pod during a rollout. It would stop all workspaces on that cluster, no? @jankeromnes

I wasn't sure, so I tested it:

I ran kubectl delete pod ws-manager-bridge-[tab] several times in a row

I also ran kubectl rollout restart deployment ws-manager-bridge a few times

My running workspace stayed alive and well all the time.

Only when I ran gpctl clusters deregister --name temp --force did it get marked as stopped.

Sync feedback from @geropl: Marking all instances as stopped in this bridge.stop() lifecycle method is a change in behavior -- does it really make sense?

One the one hand (i.e. "my" view), bridge.stop() is only called:

here when a cluster was de-registered (either via RPC with --force, or manually deleted from the DB) and 2. reconcile is called. In this case, the cluster is no longer connected, and we'll never hear again about any instances that were still running on it (so we can mark them as stopped/failed)

here when we're disposing of the bridge controller (is this actually ever called? It doesn't look like it, so maybe we could delete this dead / misleading / somewhat dangerous code)

So this means that bridge.stop() is only ever called when you're actually de-registering a cluster, and instead of leaving all the instances as they are in the DB without any further updates, maybe it makes more sense to mark them as stopped/failed.

On the other hand (i.e. @geropl's view), maybe we need to be clearer or more careful about the fact that calling bridge.stop() will also mark all of its instances as stopped -- i.e. this doesn't seem to be called currently when you try to reconnect/restart a bridge or so, but we need to make sure that it also won't be called in the future with the assumption that instances will be kept running). Maybe this can be achieved with a comment, or by renaming the stop function. We could also make a bridge mark all its instances as stopped only when we receive a deregister --force RPC, but this seems a bit more complicated (need to extract/share stopping code or somehow give access to the bridge to the RPC handler), and wouldn't handle the cases where a cluster is manually dropped from the DB.

My personal conclusion here would be to leave the PR as is, and simply add very explicit comments to stress the fact that calling bridge.stop() also marks all its instances as stopped in the DB. Or, we could even rename the method to bridge.tearDown() or bridge.markAllInstancesAsStoppedAndDispose or similar. 😊

What you describe makes sense to me, it just seems that the lifecycle you are attaching the effect of stopping all instances is not well understood (I agree we should better understand it, and remove code that is not called in practice). Attaching the clean-up to the disconnect lifecycle also requires it to be always called, so that we don't leak/forget about workspaces.

It seems more general and less complicated to clean up instances for which we don't know a manager on a regular schedule instead of on disconnect events.

geropl

https://github.com/gitpod-io/gitpod/pull/12912/files#r971936170

geropl · 2022-09-19T07:48:11Z

@jankeromnes Given the mis-alignment here (which is also due to other recent PRs improving in this area: I think it make sense to ship the fixes to markWorkspaceInstanceAsStopped now, and align on the lifetime issues outside of this PR. WDYT?

jankeromnes · 2022-09-19T09:42:03Z

Thanks for the discussions! Moving this back to Draft, and splitting out the drive-by fix into #13074 so that it can be merged earlier.

… leftover running instances as stopped/failed

stale · 2022-09-30T19:52:24Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

roboquat added release-note-none do-not-merge/work-in-progress size/XS labels Sep 13, 2022

jankeromnes changed the title ~~When deregistering a workspace cluster, mark any leftover running instances as stopped/failed~~ When de-registering a workspace cluster, mark any leftover running instances as stopped/failed Sep 13, 2022

jankeromnes force-pushed the jx/gc-deregistered-ws branch from 91e0678 to 742693f Compare September 13, 2022 11:45

jankeromnes force-pushed the jx/gc-deregistered-ws branch from 742693f to 5707479 Compare September 14, 2022 10:08

roboquat added size/M and removed size/XS labels Sep 14, 2022

jankeromnes force-pushed the jx/gc-deregistered-ws branch from 5707479 to 01262e8 Compare September 14, 2022 12:40

jankeromnes marked this pull request as ready for review September 15, 2022 12:04

jankeromnes requested a review from a team September 15, 2022 12:04

roboquat removed the do-not-merge/work-in-progress label Sep 15, 2022

github-actions bot added the team: webapp label Sep 15, 2022

jankeromnes commented Sep 15, 2022

View reviewed changes

components/ws-manager-bridge/src/bridge.ts Outdated Show resolved Hide resolved

svenefftinge reviewed Sep 15, 2022

View reviewed changes

components/ws-manager-bridge/src/bridge.ts Outdated Show resolved Hide resolved

geropl reviewed Sep 15, 2022

View reviewed changes

components/ws-manager-bridge/src/bridge.ts Outdated Show resolved Hide resolved

geropl reviewed Sep 15, 2022

View reviewed changes

geropl requested changes Sep 15, 2022

View reviewed changes

jankeromnes mentioned this pull request Sep 19, 2022

[ws-manager-bridge] Clean up instance tokens and track workspace stop even in irregular cases #13074

Merged

2 tasks

jankeromnes marked this pull request as draft September 19, 2022 09:42

roboquat added the do-not-merge/work-in-progress label Sep 19, 2022

[ws-manager-bridge] When de-registering a workspace cluster, mark any…

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode

510c8f2

… leftover running instances as stopped/failed

jankeromnes force-pushed the jx/gc-deregistered-ws branch from 01262e8 to 510c8f2 Compare September 19, 2022 09:45

stale bot added the meta: stale label Sep 30, 2022

stale bot closed this Oct 12, 2022

jankeromnes deleted the jx/gc-deregistered-ws branch December 8, 2022 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

jankeromnes commented Sep 13, 2022 •

edited

Loading

jankeromnes commented Sep 14, 2022 •

edited

Loading

svenefftinge commented Sep 15, 2022 •

edited

Loading

geropl Sep 15, 2022 •

edited

Loading

svenefftinge Sep 15, 2022

easyCZ Sep 15, 2022

geropl Sep 15, 2022 •

edited

Loading

jankeromnes Sep 16, 2022

jankeromnes Sep 16, 2022 •

edited

Loading

svenefftinge Sep 16, 2022

geropl left a comment

geropl commented Sep 19, 2022

jankeromnes commented Sep 19, 2022

stale bot commented Sep 30, 2022

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912

Conversation

jankeromnes commented Sep 13, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

Documentation

Werft options:

jankeromnes commented Sep 14, 2022 • edited Loading

On status update

In irregular cases

Differences

svenefftinge commented Sep 15, 2022 • edited Loading

geropl Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

svenefftinge Sep 15, 2022

Choose a reason for hiding this comment

easyCZ Sep 15, 2022

Choose a reason for hiding this comment

geropl Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

jankeromnes Sep 16, 2022

Choose a reason for hiding this comment

jankeromnes Sep 16, 2022 • edited Loading

Choose a reason for hiding this comment

svenefftinge Sep 16, 2022

Choose a reason for hiding this comment

geropl left a comment

Choose a reason for hiding this comment

geropl commented Sep 19, 2022

jankeromnes commented Sep 19, 2022

stale bot commented Sep 30, 2022

jankeromnes commented Sep 13, 2022 •

edited

Loading

jankeromnes commented Sep 14, 2022 •

edited

Loading

svenefftinge commented Sep 15, 2022 •

edited

Loading

geropl Sep 15, 2022 •

edited

Loading

geropl Sep 15, 2022 •

edited

Loading

jankeromnes Sep 16, 2022 •

edited

Loading